{
    "cells": [
        {
            "cell_type": "markdown",
            "metadata": {},
            "source": [
                "<i>Copyright (c) Recommenders contributors.</i>\n",
                "\n",
                "<i>Licensed under the MIT License.</i>"
            ]
        },
        {
            "cell_type": "markdown",
            "metadata": {},
            "source": [
                "#  MIND Utils Generation\n",
                "\n",
                "MIND dataset\\[1\\] is a large-scale English news dataset. It was collected from anonymized behavior logs of Microsoft News website. MIND contains 1,000,000 users, 161,013 news articles and 15,777,377 impression logs. Every news article contains rich textual content including title, abstract, body, category and entities. Each impression log contains the click events, non-clicked events and historical news click behaviors of this user before this impression.\n",
                "\n",
                "Many news recommendation methods use word embeddings, news vertical embeddings, news subvertical embeddings and user id embedding. Therefore, it is necessary to generate a word dictionary, a vertical dictionary, a subvertical dictionary and a `userid` dictionary to convert words, news verticals, subverticals and user ids from strings to indexes. To use the pretrain word embedding, an embedding matrix is generated as the initial weight of the word embedding layer.\n",
                "\n",
                "This notebook gives examples about how to generate:\n",
                "* `word_dict.pkl`: convert the words in news titles into indexes.\n",
                "* `word_dict_all.pkl`: convert the words in news titles and abstracts into indexes.\n",
                "* `embedding.npy`: pretrained word embedding matrix of words in word_dict.pkl\n",
                "* `embedding_all.npy`: pretrained embedding matrix of words in word_dict_all.pkl\n",
                "* `vert_dict.pkl`: convert news verticals into indexes.\n",
                "* `subvert_dict.pkl`: convert news subverticals into indexes.\n",
                "* `uid2index.pkl`: convert user ids into indexes."
            ]
        },
        {
            "cell_type": "code",
            "execution_count": 1,
            "metadata": {},
            "outputs": [
                {
                    "name": "stdout",
                    "output_type": "stream",
                    "text": [
                        "System version: 3.9.16 (main, May 15 2023, 23:46:34) \n",
                        "[GCC 11.2.0]\n"
                    ]
                }
            ],
            "source": [
                "import os\n",
                "import sys\n",
                "import numpy as np\n",
                "import pandas as pd\n",
                "from tqdm import tqdm\n",
                "import pickle\n",
                "from collections import Counter\n",
                "from tempfile import TemporaryDirectory\n",
                "\n",
                "from recommenders.datasets.mind import (download_mind,\n",
                "                                     extract_mind,\n",
                "                                     download_and_extract_glove,\n",
                "                                     load_glove_matrix,\n",
                "                                     word_tokenize\n",
                "                                    )\n",
                "from recommenders.datasets.download_utils import unzip_file\n",
                "from recommenders.utils.notebook_utils import store_metadata\n",
                "\n",
                "print(\"System version: {}\".format(sys.version))\n"
            ]
        },
        {
            "cell_type": "code",
            "execution_count": 2,
            "metadata": {
                "tags": [
                    "parameters"
                ]
            },
            "outputs": [],
            "source": [
                "# MIND sizes: \"demo\", \"small\" or \"large\"\n",
                "mind_type=\"demo\" \n",
                "# word_embedding_dim should be in [50, 100, 200, 300]\n",
                "word_embedding_dim = 300"
            ]
        },
        {
            "cell_type": "code",
            "execution_count": 3,
            "metadata": {},
            "outputs": [
                {
                    "name": "stderr",
                    "output_type": "stream",
                    "text": [
                        "100%|██████████████████████████████████████████████████████████████████████████| 17.0k/17.0k [00:05<00:00, 2.92kKB/s]\n",
                        "100%|██████████████████████████████████████████████████████████████████████████| 9.84k/9.84k [00:01<00:00, 6.80kKB/s]\n"
                    ]
                }
            ],
            "source": [
                "tmpdir = TemporaryDirectory()\n",
                "data_path = tmpdir.name\n",
                "train_zip, valid_zip = download_mind(size=mind_type, dest_path=data_path)\n",
                "unzip_file(train_zip, os.path.join(data_path, 'train'), clean_zip_file=False)\n",
                "unzip_file(valid_zip, os.path.join(data_path, 'valid'), clean_zip_file=False)\n",
                "output_path = os.path.join(data_path, 'utils')\n",
                "os.makedirs(output_path, exist_ok=True)"
            ]
        },
        {
            "cell_type": "markdown",
            "metadata": {},
            "source": [
                "## Prepare utils of news\n",
                "\n",
                "* word dictionary\n",
                "* vertical dictionary\n",
                "* subvetical dictionary"
            ]
        },
        {
            "cell_type": "code",
            "execution_count": 4,
            "metadata": {},
            "outputs": [],
            "source": [
                "news = pd.read_table(os.path.join(data_path, 'train', 'news.tsv'),\n",
                "                     names=['newid', 'vertical', 'subvertical', 'title',\n",
                "                            'abstract', 'url', 'entities in title', 'entities in abstract'],\n",
                "                     usecols = ['vertical', 'subvertical', 'title', 'abstract'])"
            ]
        },
        {
            "cell_type": "code",
            "execution_count": 5,
            "metadata": {},
            "outputs": [
                {
                    "data": {
                        "text/html": [
                            "<div>\n",
                            "<style scoped>\n",
                            "    .dataframe tbody tr th:only-of-type {\n",
                            "        vertical-align: middle;\n",
                            "    }\n",
                            "\n",
                            "    .dataframe tbody tr th {\n",
                            "        vertical-align: top;\n",
                            "    }\n",
                            "\n",
                            "    .dataframe thead th {\n",
                            "        text-align: right;\n",
                            "    }\n",
                            "</style>\n",
                            "<table border=\"1\" class=\"dataframe\">\n",
                            "  <thead>\n",
                            "    <tr style=\"text-align: right;\">\n",
                            "      <th></th>\n",
                            "      <th>vertical</th>\n",
                            "      <th>subvertical</th>\n",
                            "      <th>title</th>\n",
                            "      <th>abstract</th>\n",
                            "    </tr>\n",
                            "  </thead>\n",
                            "  <tbody>\n",
                            "    <tr>\n",
                            "      <th>0</th>\n",
                            "      <td>lifestyle</td>\n",
                            "      <td>lifestyleroyals</td>\n",
                            "      <td>The Brands Queen Elizabeth, Prince Charles, an...</td>\n",
                            "      <td>Shop the notebooks, jackets, and more that the...</td>\n",
                            "    </tr>\n",
                            "    <tr>\n",
                            "      <th>1</th>\n",
                            "      <td>news</td>\n",
                            "      <td>newsworld</td>\n",
                            "      <td>The Cost of Trump's Aid Freeze in the Trenches...</td>\n",
                            "      <td>Lt. Ivan Molchanets peeked over a parapet of s...</td>\n",
                            "    </tr>\n",
                            "    <tr>\n",
                            "      <th>2</th>\n",
                            "      <td>health</td>\n",
                            "      <td>voices</td>\n",
                            "      <td>I Was An NBA Wife. Here's How It Affected My M...</td>\n",
                            "      <td>I felt like I was a fraud, and being an NBA wi...</td>\n",
                            "    </tr>\n",
                            "    <tr>\n",
                            "      <th>3</th>\n",
                            "      <td>health</td>\n",
                            "      <td>medical</td>\n",
                            "      <td>How to Get Rid of Skin Tags, According to a De...</td>\n",
                            "      <td>They seem harmless, but there's a very good re...</td>\n",
                            "    </tr>\n",
                            "    <tr>\n",
                            "      <th>4</th>\n",
                            "      <td>weather</td>\n",
                            "      <td>weathertopstories</td>\n",
                            "      <td>It's been Orlando's hottest October ever so fa...</td>\n",
                            "      <td>There won't be a chill down to your bones this...</td>\n",
                            "    </tr>\n",
                            "  </tbody>\n",
                            "</table>\n",
                            "</div>"
                        ],
                        "text/plain": [
                            "    vertical        subvertical  \\\n",
                            "0  lifestyle    lifestyleroyals   \n",
                            "1       news          newsworld   \n",
                            "2     health             voices   \n",
                            "3     health            medical   \n",
                            "4    weather  weathertopstories   \n",
                            "\n",
                            "                                               title  \\\n",
                            "0  The Brands Queen Elizabeth, Prince Charles, an...   \n",
                            "1  The Cost of Trump's Aid Freeze in the Trenches...   \n",
                            "2  I Was An NBA Wife. Here's How It Affected My M...   \n",
                            "3  How to Get Rid of Skin Tags, According to a De...   \n",
                            "4  It's been Orlando's hottest October ever so fa...   \n",
                            "\n",
                            "                                            abstract  \n",
                            "0  Shop the notebooks, jackets, and more that the...  \n",
                            "1  Lt. Ivan Molchanets peeked over a parapet of s...  \n",
                            "2  I felt like I was a fraud, and being an NBA wi...  \n",
                            "3  They seem harmless, but there's a very good re...  \n",
                            "4  There won't be a chill down to your bones this...  "
                        ]
                    },
                    "execution_count": 5,
                    "metadata": {},
                    "output_type": "execute_result"
                }
            ],
            "source": [
                "news.head()"
            ]
        },
        {
            "cell_type": "code",
            "execution_count": 6,
            "metadata": {},
            "outputs": [],
            "source": [
                "news_vertical = news.vertical.drop_duplicates().reset_index(drop=True)\n",
                "vert_dict_inv = news_vertical.to_dict()\n",
                "vert_dict = {v: k+1 for k, v in vert_dict_inv.items()}\n",
                "\n",
                "news_subvertical = news.subvertical.drop_duplicates().reset_index(drop=True)\n",
                "subvert_dict_inv = news_subvertical.to_dict()\n",
                "subvert_dict = {v: k+1 for k, v in vert_dict_inv.items()}"
            ]
        },
        {
            "cell_type": "code",
            "execution_count": 7,
            "metadata": {},
            "outputs": [],
            "source": [
                "news.title = news.title.apply(word_tokenize)\n",
                "news.abstract = news.abstract.apply(word_tokenize)"
            ]
        },
        {
            "cell_type": "code",
            "execution_count": 8,
            "metadata": {},
            "outputs": [
                {
                    "name": "stderr",
                    "output_type": "stream",
                    "text": [
                        "100%|████████████████████████████████████████████████████████████████████████| 26740/26740 [00:02<00:00, 9093.49it/s]\n"
                    ]
                }
            ],
            "source": [
                "word_cnt = Counter()\n",
                "word_cnt_all = Counter()\n",
                "\n",
                "for i in tqdm(range(len(news))):\n",
                "    word_cnt.update(news.loc[i]['title'])\n",
                "    word_cnt_all.update(news.loc[i]['title'])\n",
                "    word_cnt_all.update(news.loc[i]['abstract'])"
            ]
        },
        {
            "cell_type": "code",
            "execution_count": 9,
            "metadata": {},
            "outputs": [],
            "source": [
                "word_dict = {k: v+1 for k, v in zip(word_cnt, range(len(word_cnt)))}\n",
                "word_dict_all = {k: v+1 for k, v in zip(word_cnt_all, range(len(word_cnt_all)))}"
            ]
        },
        {
            "cell_type": "code",
            "execution_count": 10,
            "metadata": {},
            "outputs": [],
            "source": [
                "with open(os.path.join(output_path, 'vert_dict.pkl'), 'wb') as f:\n",
                "    pickle.dump(vert_dict, f)\n",
                "    \n",
                "with open(os.path.join(output_path, 'subvert_dict.pkl'), 'wb') as f:\n",
                "    pickle.dump(subvert_dict, f)\n",
                "\n",
                "with open(os.path.join(output_path, 'word_dict.pkl'), 'wb') as f:\n",
                "    pickle.dump(word_dict, f)\n",
                "    \n",
                "with open(os.path.join(output_path, 'word_dict_all.pkl'), 'wb') as f:\n",
                "    pickle.dump(word_dict_all, f)"
            ]
        },
        {
            "cell_type": "markdown",
            "metadata": {},
            "source": [
                "## Prepare embedding matrixs\n",
                "* embedding.npy\n",
                "* embedding_all.npy"
            ]
        },
        {
            "cell_type": "code",
            "execution_count": 11,
            "metadata": {},
            "outputs": [
                {
                    "name": "stderr",
                    "output_type": "stream",
                    "text": [
                        "100%|████████████████████████████████████████████████████████████████████████████| 842k/842k [02:45<00:00, 5.08kKB/s]\n"
                    ]
                }
            ],
            "source": [
                "glove_path = download_and_extract_glove(data_path)"
            ]
        },
        {
            "cell_type": "code",
            "execution_count": 12,
            "metadata": {},
            "outputs": [
                {
                    "name": "stderr",
                    "output_type": "stream",
                    "text": [
                        "400000it [00:06, 60728.10it/s]\n",
                        "400000it [00:07, 50299.10it/s]\n"
                    ]
                }
            ],
            "source": [
                "embedding_matrix, exist_word = load_glove_matrix(glove_path, word_dict, word_embedding_dim)\n",
                "embedding_all_matrix, exist_all_word = load_glove_matrix(glove_path, word_dict_all, word_embedding_dim)"
            ]
        },
        {
            "cell_type": "code",
            "execution_count": 13,
            "metadata": {},
            "outputs": [],
            "source": [
                "np.save(os.path.join(output_path, 'embedding.npy'), embedding_matrix)\n",
                "np.save(os.path.join(output_path, 'embedding_all.npy'), embedding_all_matrix)"
            ]
        },
        {
            "cell_type": "markdown",
            "metadata": {},
            "source": [
                "## Prepare uid2index.pkl"
            ]
        },
        {
            "cell_type": "code",
            "execution_count": 14,
            "metadata": {},
            "outputs": [
                {
                    "name": "stderr",
                    "output_type": "stream",
                    "text": [
                        "22034it [00:00, 89146.42it/s]\n"
                    ]
                }
            ],
            "source": [
                "uid2index = {}\n",
                "\n",
                "with open(os.path.join(data_path, 'train', 'behaviors.tsv'), 'r') as f:\n",
                "    for l in tqdm(f):\n",
                "        uid = l.strip('\\n').split('\\t')[1]\n",
                "        if uid not in uid2index:\n",
                "            uid2index[uid] = len(uid2index) + 1"
            ]
        },
        {
            "cell_type": "code",
            "execution_count": 15,
            "metadata": {},
            "outputs": [],
            "source": [
                "with open(os.path.join(output_path, 'uid2index.pkl'), 'wb') as f:\n",
                "    pickle.dump(uid2index, f)"
            ]
        },
        {
            "cell_type": "code",
            "execution_count": 19,
            "metadata": {},
            "outputs": [
                {
                    "data": {
                        "text/plain": [
                            "{'vert_num': 17,\n",
                            " 'subvert_num': 17,\n",
                            " 'word_num': 23404,\n",
                            " 'word_num_all': 41074,\n",
                            " 'embedding_exist_num': 22408,\n",
                            " 'embedding_exist_num_all': 37634,\n",
                            " 'uid2index': 5000}"
                        ]
                    },
                    "execution_count": 19,
                    "metadata": {},
                    "output_type": "execute_result"
                }
            ],
            "source": [
                "utils_state = {\n",
                "    'vert_num': len(vert_dict),\n",
                "    'subvert_num': len(subvert_dict),\n",
                "    'word_num': len(word_dict),\n",
                "    'word_num_all': len(word_dict_all),\n",
                "    'embedding_exist_num': len(exist_word),\n",
                "    'embedding_exist_num_all': len(exist_all_word),\n",
                "    'uid2index': len(uid2index)\n",
                "}\n",
                "utils_state"
            ]
        },
        {
            "cell_type": "code",
            "execution_count": 17,
            "metadata": {},
            "outputs": [
                {
                    "data": {
                        "application/scrapbook.scrap.json+json": {
                            "data": {
                                "embedding_exist_num": 22408,
                                "embedding_exist_num_all": 37634,
                                "subvert_num": 17,
                                "uid2index": 5000,
                                "vert_num": 17,
                                "word_num": 23404,
                                "word_num_all": 41074
                            },
                            "encoder": "json",
                            "name": "utils_state",
                            "version": 1
                        }
                    },
                    "metadata": {
                        "scrapbook": {
                            "data": true,
                            "display": false,
                            "name": "utils_state"
                        }
                    },
                    "output_type": "display_data"
                }
            ],
            "source": [
                "# Record results for tests - ignore this cell\n",
                "store_metadata(\"vert_num\", len(vert_dict))\n",
                "store_metadata(\"subvert_num\", len(subvert_dict))\n",
                "store_metadata(\"word_num\", len(word_dict))\n",
                "store_metadata(\"word_num_all\", len(word_dict_all))\n",
                "store_metadata(\"embedding_exist_num\", len(exist_word))\n",
                "store_metadata(\"embedding_exist_num_all\", len(exist_all_word))\n",
                "store_metadata(\"uid2index\", len(uid2index))"
            ]
        },
        {
            "cell_type": "code",
            "execution_count": 18,
            "metadata": {},
            "outputs": [],
            "source": [
                "tmpdir.cleanup()"
            ]
        },
        {
            "cell_type": "markdown",
            "metadata": {},
            "source": [
                "## References\n",
                "\n",
                "\\[1\\] Wu, Fangzhao, et al. \"MIND: A Large-scale Dataset for News Recommendation\" Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics. https://msnews.github.io/competition.html <br>"
            ]
        }
    ],
    "metadata": {
        "celltoolbar": "Tags",
        "kernelspec": {
            "display_name": "Python (recommenders)",
            "language": "python",
            "name": "recommenders"
        },
        "language_info": {
            "codemirror_mode": {
                "name": "ipython",
                "version": 3
            },
            "file_extension": ".py",
            "mimetype": "text/x-python",
            "name": "python",
            "nbconvert_exporter": "python",
            "pygments_lexer": "ipython3",
            "version": "3.9.16"
        }
    },
    "nbformat": 4,
    "nbformat_minor": 4
}
